32 research outputs found
Accurate Human Motion Capture and Modeling using Low-cost Sensors
Motion capture technologies, especially those combined with multiple kinds of sensory technologies to capture both kinematic and dynamic information, are widely used in a variety of fields such as biomechanics, robotics, and health. However, many existing systems suffer from limitations of being intrusive, restrictive, and expensive.
This dissertation explores two aspects of motion capture systems that are low-cost, non-intrusive, high-accuracy, and easy to use for common users, including both full-body kinematics and dynamics capture, and user-specific hand modeling.
More specifically, we present a new method for full-body motion capture that uses input data captured by three depth cameras and a pair of pressure-sensing shoes. Our system is appealing because it is fully automatic and can accurately reconstruct both full-body kinematic and dynamic data. We introduce a highly accurate tracking process that automatically reconstructs 3D skeletal poses using depth data, foot pressure data, and detailed full-body geometry. We also develop an efficient physics-based motion reconstruction algorithm for solving internal joint torques and contact forces based on contact pressure information and 3D poses from the kinematic tracking process.
In addition, we present a novel low-dimensional parametric model for 3D hand modeling and synthesis. We construct a low-dimensional parametric model to compactly represent hand shape variations across individuals and enhance it by adding Linear Blend Skinning (LBS) for pose deformation. We also introduce an efficient iterative approach to learn the parametric model from a large unaligned scan database. Our model is compact, expressive, and produces a natural-looking LBS model for pose deformation, which allows for a variety of applications ranging from user-specific hand modeling to skinning weights transfer and model-based hand tracking
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large-scale datasets of image-caption pairs and
achieve superior transferability and robustness on downstream tasks, but they
are difficult to use in many practical applications due to their large size,
high latency and fixed architectures. Unfortunately, recent work shows training
a small custom VLFM for resource-limited applications is currently very
difficult using public and smaller-scale data. In this paper, we introduce a
new distillation mechanism (DIME-FM) that allows us to transfer the knowledge
contained in large VLFMs to smaller, customized foundation models using a
relatively small amount of inexpensive, unpaired images and sentences. We
transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32
model, with only 40M public images and 28.4M unpaired public sentences. The
resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained
on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves
similar results in terms of zero-shot and linear-probing performance on both
ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also
displays comparable robustness when evaluated on five datasets with natural
distribution shifts from ImageNet
Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search
Recent work in network quantization has substantially reduced the time and
space complexity of neural network inference, enabling their deployment on
embedded and mobile devices with limited computational and memory resources.
However, existing quantization methods often represent all weights and
activations with the same precision (bit-width). In this paper, we explore a
new dimension of the design space: quantizing different layers with different
bit-widths. We formulate this problem as a neural architecture search problem
and propose a novel differentiable neural architecture search (DNAS) framework
to efficiently explore its exponential search space with gradient-based
optimization. Experiments show we surpass the state-of-the-art compression of
ResNet on CIFAR-10 and ImageNet. Our quantized models with 21.1x smaller model
size or 103.9x lower computational cost can still outperform baseline quantized
or even full precision models
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference
Vision Transformers (ViTs) have shown impressive performance but still
require a high computation cost as compared to convolutional neural networks
(CNNs), one reason is that ViTs' attention measures global similarities and
thus has a quadratic complexity with the number of input tokens. Existing
efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g.,
Performer), which sacrifice ViTs' capabilities of capturing either global or
local context. In this work, we ask an important research question: Can ViTs
learn both global and local context while being more efficient during
inference? To this end, we propose a framework called Castling-ViT, which
trains ViTs using both linear-angular attention and masked softmax-based
quadratic attention, but then switches to having only linear angular attention
during ViT inference. Our Castling-ViT leverages angular kernels to measure the
similarities between queries and keys via spectral angles. And we further
simplify it with two techniques: (1) a novel linear-angular attention
mechanism: we decompose the angular kernels into linear terms and high-order
residuals, and only keep the linear terms; and (2) we adopt two parameterized
modules to approximate high-order residuals: a depthwise convolution and an
auxiliary masked softmax attention to help learn both global and local
information, where the masks for softmax attention are regularized to gradually
become zeros and thus incur no overhead during ViT inference. Extensive
experiments and ablation studies on three tasks consistently validate the
effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher
accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on
COCO detection under comparable FLOPs, as compared to ViTs with vanilla
softmax-based attentions.Comment: CVPR 202
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Open-vocabulary semantic segmentation aims to segment an image into semantic
regions according to text descriptions, which may not have been seen during
training. Recent two-stage methods first generate class-agnostic mask proposals
and then leverage pre-trained vision-language models, e.g., CLIP, to classify
masked regions. We identify the performance bottleneck of this paradigm to be
the pre-trained CLIP model, since it does not perform well on masked images. To
address this, we propose to finetune CLIP on a collection of masked image
regions and their corresponding text descriptions. We collect training data by
mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to
match masked image regions to nouns in the image captions. Compared with the
more precise and manually annotated segmentation labels with fixed classes
(e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain
CLIP's generalization ability. Along with finetuning the entire model, we
utilize the "blank" areas in masked images using a method we dub mask prompt
tuning. Experiments demonstrate mask prompt tuning brings significant
improvement without modifying any weights of CLIP, and it can further improve a
fully finetuned model. In particular, when trained on COCO and evaluated on
ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the
previous state-of-the-art. For the first time, open-vocabulary generalist
models match the performance of supervised specialist models in 2017 without
dataset-specific adaptations.Comment: Project page: https://jeff-liangf.github.io/projects/ovse